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ABSTBACT • . ; ' ' ' ' t 

For 'classical>:;norm-referenced test reliability, 
CroBbach's alpha has been shown to be equal to the mean or ^11 
pb&sible split-,hal£" Pearson product-moment correlation coefficients," 
adjusted by th.e Spearman-Brown prophecy formula. For 
Grlterion-fefeE.encefl test reliability, in an analogous vein,' this 
paper pr'o.vides the rationale behind, the analysis, of , • computational 
formulas for, and characteristics of a coefficient equal to the mean 
of all, possible split-*-haif coefficients of agreement. In addition, 
the relation of this coeffici^ent to other teat indices, including ^ 
those of Hartis and Livingston, is, presented. (Author) ' ' 
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In the past decade, an increased acceptance of the iiiterrelati^ 
notions of behavioj?al objectives, individilall|zed instruction' and mastery * 
learping has given cise to new kinds of educational testS. One of these 
new kinds of tests has as its purpose the 'efficient separation of the 
sample of examinees into two groups, often labeled "Nonmastery" and 
"Mastery,". When there Srp only two courses of action available to an 
examinee .after this kind of test is taken, (i.e., stay in that instruc- 
tional module, or go on to studying the next behavioral objective) these 

« 

two "scores" are the only two that need be reported. There is no purpose 
served in' further subdivision of the test scores; the dichotomy is nec- 
essary and sufficient. • Such a test, composed of several items drawn ^ 

k 

from a wfell-defined universe, measuring a single, narrow behavioral 
objective, and resulting in a dichotomous classification with reference 

/ 

to a predetermined criterion level, is called a criterion-referenced test 
(CRX) in this paper.* , ' ^ ^ 

The differences between a CRT and the more familiar norm-referenced 
test (NR^) have implications for psychometric theory. One qf these- is the 

fact that the variance of the scores obtained using a CRT need not be 

> . . . ' ^ 

large. Also among these are^ the notions that "if true score is consider,ed 

dichotomous, then misclassif ication is the primary kind of measurement 

error associated with a QRT^^^nd certaift , other axioms dn which . traditional 

reliability is based are. not satisfied. . Iq, sum, the puifpose^i desired score 

distributions, construction, outcomes, and mathematical underpinnings of 



7^ • - . 

* 

It is recognized that thefe are a number of writers who, with varying 
degrees af vehemence, would disagree with this definition -of a CRT. 
"CRT" is merely us^d as a label for thV kind of test described above. 
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reliability for CRTs are not necessarily t^e same as Tor'toXs/ More- ^ 

over, the meanings of reliabJli^y are," or'^should^be, different. 

Whereas an NRT is^ reliable insofar as an examinej^ receives the same 

score on two parallel 'sets of data, aXRT should be reliable insofar as . 

/' * ' . • ^ , ^ 

the examinee rec^ve^the same 'dichotomous categorization from the two • 

/ . ' . . . , . < 

data But if classical reliabili|ty estimates ai^ inappropriate for 

CRTs, what should take their place? ^ . , 

A number of authojs (Berger, 1973; Carver, 1970; Goodman & Kruskal, 

1954; Hambleton & Novick, 1973) have suggested using a .rather simple 

diial-administration (test-retest or^paraiJ^el forms) coefficient ^for 

CKT reliability. This index is frequently palled the coefficient of agreep 

mei^t*, an<J is the proportion of examinees classified similarly on the t\3iO 

/ ' ; - ^ 

test administrations. If + and - stand for the two classifications into -> 

which the examinees are dichotomized and^ the following four-fold contingency 



tabl^ represfents the results from the two test administrations: 

' r 

+ - 



A 


B 




C 


D 






N 



V 



then the coefficient of agreement (here labeled P for simplicity) is / 

But this coefficent is for two test administrations, requiring either a 
retest or parallel, forms. Can this same framewolrk (proportion of 



\ 

\ 



S $ 

Cohen's kappa (Cohen, 1960) .has al&o been called the "coefficient of 
agreement" (Swaminatlian et al.., 1974). 'The indices are related, but^ 
they ate /not the same*. 



consistent classifications) yield a CRT analogue tq the familiar- 
single administration index of internal consistency? A coefficient of 
agreement calculated from splitting the test into halves would be sub- 
Ject to the sari^e criticism as were split-half methods with classical 
reliability coefficients a few decades ^go^i.e,, the test split chosen 
might yield unrepresentative results. However, a lead is suggested by 
the' fact that thiB classical internal consistencey index was shown 
(Cronbach, 1951) to be equal to the mean of all possi^ble split-half 
reliabif-lity coefficients. To make an analogy w4.th Cronbach^s alpha, 

ti 

I r ' 

then, ir would seem fruitful to considet arr index equal to tne mean of 

all possible split-half coefficients of agreement. To extend the analogy 

further, this index. is labeled coefficient beta (3), ^ 

The coefficient" ' ^ 

There are ^^2^ ^ possible test splits for an n-item test (where n 
is eV^n) 1^ each half is ccAisidered to be labeled (i.e., for a two- item 
test the split 1 / 2 is different from 2 / 1 ) . ^ If 3 is the mean of 'all 
possible split-half coefficients of ajgreei^ent, then frqm (1), 

P " V s=l -^s . • ^ ' 



V 

i 2 
V s«l 



A + D 

s s 

N 



which can be rewritten 
r- V 



(2) 3 



1 
N 



E (A + D ) 
^ s s 

S"l . 

V - 
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Thus 0 is also tlie average (ovej: persons) proportion (over test splits) 
of consistent classifications (+, + or -). 

It is shown in the appendix qf this "paper that fox any person, the. 
proportion of test splits which yifeld consistent classifications is a 
function of that person's total score, for a given number of items and 
criterion level. For instance. , for a 20-item test and a criterion level 
o^ 80Z, a person with a score of 7 or lower will be classified consistently 
(nonmastery/nonmastery) on all test splits, siilce a score of at least 8 
1^ needed to achieve mastery on a half-test. Ltkewise, a person with a 
score of 18. or more will be classified similarly (nastery /mastery) for all 
test splits. , Persons with sco/es of from 8 to 17, however, are classified 
consistently for some test splits and not for others — for example, a person 
with a score of 12 will be classified similarly if the test split yields 
half-tfe^ scores of, say, 5 and 7, but not for splits which yield, say, 
9 and 3. ' 

The computing formula for coefficient 0 is 



(3) e - i 



k-l 2k-2 . , 

51 f + S f .* (x-[k-i], k-l) 

X-0 . X-k • 

^ +k-l n 

X-2k . X- £+k ^ 

/ 

where N " the number of persons, 
X " a person's total score, 

f -^the frequency of score X in the distribution of total scores, 
k ° the minimum of items on either half-test that must be 

answered correctly to rfchieve a "mastery" classification on 

that half-tr*st. 



6 



*»^Vhe number of Items, and 



;) 



I.e. the proportion of splits which yield a half-test score ot from 
a to b inclusive*, given a total scores of X. The derivation of this 

formula may be found in the appendix. 

' ^ , I 

Some examples 

a 

A formula of th.is complexity is more suited to k computer than to 
hand calculation, but some examples may clarity mat^'ers. To keep compu-r 
tations relatively simple, the following cases are for 8 items, 10 
examinees, and criterion level of 75%, yielding a cut-off score of 6 
(and hence k =■ 3). * 

Example 1 . Consider the total score. vector X =» (1,3,4,5,5,6,6,6,7,8). 
Note that this score distribution is unimodal, with most scores neat the 
cut-off, and that half the examinees are classified "mastery" and half 
"nonmastery." This is not the kind of score distribution one would normally 
hope for on a CRT. Then 



0 



10 



^^0 ^1 ^2^ ( + V4(2,2) ) 



+ (f^4'^(3,3) + (f^ + fg) 



10 



(0+1+0) V [ 1. r (j 

^ ' ^. (I) • (I) ) 



1) 



t 



10 



1 



+ 2 



_2 
10 



_i 
10 



1 + 3'10 _^ 3'10 ^ 6-6 



70 



70 



70 



1 + .429 + .429 + .514 + 3(.571J ' + 2 



^ [6.085] 



• 61, approximately. 



Example 2. 'T » (0,1,1,2,3,6,7,7,8,8). This score distribution is 
bimodal, with a^ap sjeparating the Scores of the half labeled "masters" 
from the half labeled "nonmasters." This more closely represents the kind 

of score distribution one would look for when administeting a test designed 

^ ■ ... 

to separate the examinees into two groups. Here, 4 



(f 



Q + + f^) + (£3.4.3(1,2)) +(f^..,^^(3;3)) + (f^ + fg) 



(note that f, =• f2^= 0^ 
4 5 



■_1 
10 



10 



i r (1 + 2 V 1) + f 1 . .f ' (.l)(4-l)\ + {jUi) + (2 + 

10 L _ ^. ' o' . (I) 



2) 



^ 3-10 _^ 3-10 ^ 



70 



70 



4" 

20' 2 

70 



+ 4 



] 



[ 4 + .429 + .429 +.571 + 4] 



~ [9.429] 



^ ,9Mt approximacely , 
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Example 3 , Lest one be , tempted to^attVibute the difference In 
values ofv3 to the fact that the variance of the second score dist^ri- 

bution is about two and a hal^ times as large as that of the first 

\ f ^ 

example, it , should be pointed out that the magnitude of 3 does not rely 

on score variance. Thus, for X = (6,6,7,7,7,8,8,8,8,^^8), wh^re all ex- 

aminees are classified "masters'* and where the score variance is only 

about a sixth as large as that- of the first ^example, 3 =■ .91. * 

Adjustment for odd number of items ^ 

Thus far it has been assumed that the t'est has "^n even number of 

<* 

items. If n is odd, a tjsst split is defined as fesul£,ing when one item 

is deleted and the remaining n-1 (even) items are divided into two sets, 
' n-1 

"^each containing — r — items. The procedure for computing 3 is identical 

for even and odd n, except that in the latter case we first perform an 

additional step, for reasons explained in the appendix, replacing f by 
(n - x)f + (X + l)f^ _ 

for X 0,1, . . . ,n-l and then u^ing n-1 in place 

n ^ , 

of n in the computations of k and ()> (a,b). 
Properties of Coefficient g 

1. Coefficient g is additive; it is the mean of its component parfs.. 
Thus each person's score makes a contribution to the value of g. More- 
over, it is apparent from Equation^ (3) and from the analysis given in 
the appendix thAt as a score approaches. the point 2k-l ^where k is the half- 
test cut-off score, as defined for equation (3)), it ^contributes success- 
ively less to the value of g; a score of 2k-l contributes zero^ If C 
represents the (integral »cut-off score, 2k-l is either C or (as in this ex*- 
ampl^) C-1. (See Marshall (in press) for a more^horough disduasion. ) 
What this means is that aq scored depart from the^ cut-off, the value of g 

9 



Increases, a fact (that is consonant with the notion that S measures. 

) ' " . . s 

consistency of dichotomous classification. 

2. Coefficient ^ is variance-freQ in ^he respect deemed most important 
by critics of a variance-deperujent CRT reliability coefficient: 3 can 
take , on its full range ""of. [0,' l] ,even though the total score variance 

is zero, depending on the relative locations of the cut-off score and the 
(s ingle-member ed) set of test scores. It is, however, vari^ce-dependent 
in other respects. As the variance approaches its maximum, coefficient* 
beta approaches 1, which is reassuring sinc6 maximum variance obtains 
only when spores on an n-item test ai?e equally divided between 0 and n, 
which scores indicate the qlearest possible separation into masters and 
nonmasters. Furthermore, if 3 is zero, then variance is zero. Theses- 
facts can be easily summarized: if variance is high, 3 is high; if 
variance is Ibw, tliere is no restriction (within its r'ange^ on 8 . |^ 

3. For a given test type and criterion level, the value of <3 is not 
affected by the number of examinees. 

4. ^or a given test type and criterion level, the' value of 3 is, however, 

affected by the number of items: 3- increases as the number of items 

increases. A study using simulated data (liarshall, in^ p'tess) indicates 

that although shape) of score distribution also has some effect, one can 

prophesize reasonably well the value of B for a test twice as long via 

e (3 + 0 ) 

the formula 3o - — • This formula, arrived at from purely 

2n 2(1 + 0 ) 

n ' 

empirical grounds, is the arithmetic mean of the values obtained from 

the Spearman- Brown prophecy formula f(3) and the prediction f^) - 3. 



iU 



5. The value of coefficient S "is (usually) different for different 
criterion levels, there are a total of n/2 meaningful criterion levels 
for an n-item test, slrf^e formula* (3) utilizes k,^the cut-off score- 
on a half--test. As criterion level, expressed as a fraction, approaclies 
Its meaningful limits of 2/n or 1, S generally tends toward 1, parti- 
cularly for symmetric unlmodal distribution's. 

Relations with, other tegt Indices . * 

All results reported In this section are based^on simulate^ data and 
are treated In more detail In a forthcoming report (Marshall, In ptress). 

1, " There seems to iJe a fairly high correlation across test types between 
•a (i.e., KR-20) and 3, the mean value df 3 overr criterion levels. 

2. There is little if any connection between 3 and the index of efficiency 
2 

\x (Harris, ' 1972) , except that for unimodal score distributions the 

C ' 



fluctuations of the two indices over criterion level ^eem to be cqjposite 
in direction. , ^ . 

3. For criterion levels most likely to be usedrln kn actual test Ci-e., 
from .6 to .9), 3 has a moderate correspondence with the phi toef f iclent ,\ 
when the phi coefficient is calculated from a four-fold contingency 
Dable wKose^ cells are the means of all possible split-half classif icatlonSj 
under which conditions the phi coefficient can be iionstrued to^^ a 
single-administration index. Under these same conditions, phi is iden- 
tical to Cohen's kappa when calculated' from the same table. This co- 
efficient is in turn a close lower b'ound to th^mean of all possible 
split-half kappa coefficients. 

A. For tests yielding unimodal score distributions, 3 seems to be 

2 

measuring much the same thing as does k (Livingston, 1972).* For 
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♦ 2 
these unlmodal cflsftrlbutlons, both g and k have somewhat similar 

• tx 

ranges of values 4nd patterns of fluctuation over criterion leVel. 
However, and juSt as important, tTiis close relationship does not hold 
for blmodal 4istriwjtions. The reason is that^ 3"*ls sensitive tcj^(has. 

minima^ over criterion levels, near) the mode(s) of the scare dtirtrl- 

^ 2 . 

bution, whereas is sensitive to (has minimuja, over criterion levels, 

' * tx 

at) the test mean. In a unlmodal distribution the mean and mode are 
usually proximate and the effect is the sam&; this is, of course, not 

generally ^TTBs^ase for a bimodal score distribution\ Figure 1 shows the 

2 8 " - 

fluctuations (over criterion level) of 3 and k for a unlmodal and a 

tx 

bimodal distribution, dnd shows the patterns described aljove. 

f ' - 

i 

Discussion 

Although attentipn in this paper hhs been fpcused on criterion- 

v., 

referenced tfests, it should be pointed out that coefficient beta is 
applicable -an/ time that it makes sense to look at reliability as 
consistency of classification or consistency of decision-making based 
on scores from a measuring instrument, provided tWat the decision is 
based pn some sort of cut-off point 'expressable as a percent of items 
responded to in a certain manner. 

Second, coefficient beta can be used as a tool* to help a criterion- 
ref erenceJlK^test developer search for the cutting score which best sepa- 
rates a population into two classifications. The procedure would be, 
given the t^est score distribution on a^larfee, representative 8ampl|^of* 
the population, to calculate coefficient beta at all of> the meaningful 
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jcflterioxx leWis whi^h fall within a predetermined "acceptable" range 

(e.g. r.7, *.9J), and then select that criterion level which yields • 

Xhe highest coefficient beta, r[. - ■ . ^ ' 

' " . V ^ * • • 

Third, mention should be made of the fact that if students riespond • 

• ** ' • « " 

-« > 

randomly to: the answers on a tfest, the resulting coefficient -beta would^ 

not be zero, as might be expected with a traditipnal reliability measure, 

*. . • *' ' ■ 

In fact, depending on the number of items, the criterion level, and the 

number of optiofis per item (assuming a muitiple-3|fhoice tjest),^ coefficient 
bettf could take on a rather high .value, possibly even 1, From a tradi- 
tional test theory standpoint, this is disconcerting. Yet, looked at 
from a QRT point o£ view, it is understandable: for if all examinees 
respond randomly to a test, that is a clear indication that jjthey are 
about, as far from mastery as is possible; the high value of^cq^ef f icj-ent 
beta is an indicant that the t^st*is classifying them such, arid 
reliably so. Nonetheless, a test constructer might want additional test 
tryout information before passing judgment about; the instrument s 
reliability, as would be the case in the construction of a NRT. 

Fourth, this pap6r ha^s been concerned only with tests whicH result 
In a rfichototnous classification, whereas some commercial programs prefer 
to have available a middle classification as well. It 'is shown in the 
appendix that coefficient beta can be ext^ded to encompass such a 
trichotomous classification situation. 
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Analysis and derivation^ of coefficient beta, ' 
adjustment for tests with odd number of items', 
and extension of tl>e coefficient to trichotomous data 
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definitions 



Let 



/ • 



N » the number of examinees; . / >^ 

n « -the number of test items; -^^^^ 

X « the pth person's^ total score,, p ^ l/...,N; 

c = the crit^ion level, expressed as a, fraction (Ci<c^l)j 

cn 

k " the smallest integer 3 — ;r, Jatjd hence the minimum - ' 

number of items in a half-te§'t that must be answer^ 

. f ' 

correctly to receive a mastery classification on chat 



half- test;, and 



= the pth person s $cores within the two half-tests. 



4/ 

and hence X- + X_ = X 
Ip ^ 2p p 



There are 



\ 



possibly split-halves of the n items, if 



one considers each half to be labeled (i.e., for a two-item test the ^ 
split 1/2 is dif ferent froiti 2 / 1)'-. For each ^^air^ ol split-- 
halves* construct a four-fo3/a mastery (+) / nonmastery (-) contin- 
gency table* 



* » 

and Meflne 











+ 


A 


B 






C 


D 






N 



p " 



A+D 

N , 



For now, 'only tests with an even number of Items are ccTnsldered. 
.Tests with an Odd numBer of items are dealt with- later. 
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Then 8 Is the mean of P taken over all v possihle test splits (s); 



1 

B - - >^ P 

8=1 



A + b ' 
1 «srr s s 



, /Ik +'.ED \ 



Analysis of the coefficient . 

•» 

For any given test, the set of, possible scores for an individual 
is * {0,1,. . . ,n}. For computational purposes, this is partitioned 
into- five subsets, one oi: more of which nnay be empty for a. particular 
n and k;^ ' ' . , 

^ = {0,,:.,k-l} 

^' ^ ^2 " tk,./. ,2k-2} 

S = {2k-l} 

- {2k "/2+k-l} 

■» {-J +k, . . . ,n}. ' ^ 

(Note that k«l implies ^ { }, and k « implies - { }•) 

Then consider scores in each of the five subsets: 

/ 

1. For X e S^,.X <1 k. Thus mastery on a half-test-^^cannot be ' 
pip. 

obtained no matter how the test is split, since both X^^ and must 
uecess^arily be less than k. Hence all persons with e will 
contribute to D, as defined In the contingency table abWe, for all v 
test splits. ' i * ' ^ 
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2. For X eS^,k5X^ 2k-2. Here some splits will contri- 
R 2 p 

bute to B or C (for example, = k+1; X^^ » k, X^^ « 1) and some 



will contribute to D (for * example, X^ = 2k-2; X^^ « X2p " k-1). 
The obvious question "Which splits?" becomes a problem of combina- 
torics. Since only k and D enter^ into Equation A.l, one need not 
be concerned with contributions to B and C. (These will be equally 
divided among B and C because of the symmetry implied in "labelling" 
the halves of the test.) i 

The question then reduces to "For a score of X^ c how many 
D-categorizations will result?" This will happen when neither half- 
test is mastered, i.e. when both X. ,X • £ k-1. 

Ip 2p 

Define 3^ aad '^1^ as vectors of O's and I's indicating ' 

incorrect/correct respotiSes to items on each half-test. If ctne vector 

has k-1 l*s, the other has X -(k-1) I's. Moreover, since X e 

^ ; P ^ \ 

and hence X ^.2k-2, it jfollfe^s that X -(k-1) ^ k-1. - Thus one is 

P P 

interested only in those p';i^rs of vectors in which the number, of I's 

i 

in each is between these two liaaits, namely X^-(k-l):S both X^^yX^^^ V. 

Moreover, since in the .total sco^e there are ^X I's, there are . 

P J ^ ' 

•n-X ^ O's. In the half-scotef if^ t;here are j I's, there are ^/2 - j 
P ^ ^ ■ • 

O's, Thus, for.X e S^, we can pick pairs of vectors which will 

P ^ 



yield D-»categorizations in 



k 1 /x\ / n - X \ * 
j-X-(k-l)^ ^ 



3. For X e S^, X' « 2k-l. Thus the most "balanced" split will 
P 3 P 

yield k 1*3 in one vector and k-l l*s in the. other. Indicating 
mastery in the first case and nonmastery in the second. Other, less 
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2. For X e S^, k 5 X :i 2k-2. Here some isplits will contri- 

p 2 p 

bute to B or C (for example, X = k+1; X^^p* 

will contribute to D (for example, X^ ^ 2k-2; X^^^ « « k-1). ^ 
The obyious question "Which splits?" becomes a p/oblem of combina- 
torics. Sintre only A and D enter into Equation A.l^ one need not 
be concerned with contributiona^To B £^^id C. (These will be equally 
divided among B » and C because of the symmetry implied in "labelling" 
the halves of the test.^i ' * 

The question then reduces to "For a score of X e S^, how many 
^ p / 

D-categorizatiqip* will result?" This will happen when neither half- 
test is mastered, i.e. when both X^^yX2^^ k-1. ^ 

DefineL X, and* X^ ' as vectors of O's and I's indicating 
^ Ip 2p , 

incorrect /correct responses to items on each half-test. If one vector 

* 

has k-1 I's, the other has X -(k-1) I's. Moreover, since X e^S^ 
, p p / 

and hence X ^ 2k-2, it follows that X -(k-1)^ k-1. Thus one is 
P P 

interested only in those pairs of vectors in which |he number of^l's 

in each is between these two limits, namely Xp-(k-l):S both ^i^9^2p^^'^ 

Moreover, since in- the total score there are X I's, there are 

p 

n-X O'.s. In the half-score, if there are j I's, there are "/2 - j 

P " . 

O's. Thus, for X^ e we can pick pairs of vectors which will 

k-l /x\/n-X\ 
yield D-cdtegorizations in 21 (j/(^/9-j/ ways* 

j.X-(k-l)^ / V y 

3. For X e S^, X - 2k-l. Thus the most "balanced" split will 

P 3 p 

yield k I's in. one vector and k-1 I's in the other, indicating 
mastery in the fijst case and nonmastery in the second. Other, less 
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4 



) 

"balanced" splits will yield more extreme allocations of I's, resulting 
in the same mastery/nonmastery classif ic-atlon, 'Thus, for all e S^, 
no split contributes to A or D. 

X 

4. For X e S,, 2k X £. + k-1. This case is similar to that 
p 4 p 2 

of S^. Some splits will contribute to B or . C (for example, X^ = 2k; 

X^p * k+:|., X2p = k-1) and some to A (for exstople, X^ - 2k; X^^ = X^^ " • 

Since X ^ 2k, it cannot be that both X- , Xo k, and hence there 
p . Ip ^ 2p 

^re no contributions to D. Again we ignore the contributions to 'B and 
C, 'thi!3 time focusing attention on the contributions to A, ^ 

In this case, one needs to count those vectors sqch that both half- 

tests are mastered, i.e. where both X- ,X« "2: k. If one half-test 

Ip 2p , i 



vec^r has k l*s, the other has X^-k l*s.' !| But X^ e S^ im|)lies ^ 
X 2k. which implies k^ X -^k. Thus one is Interested only in those 

p . ^ \ 

half-test vectors such that k:5 both X- ,X- ^ X -k. Using reasoning 

J Ip Zp \ p 

identical to that of case S^, .£he total number of splits which will 

X-k /X\ / n-X A 
contribute to A for X^ e S^ is 2 (^j j "/2"Jy ' 

5. For X ejs_, X "/2 .+ k. This says that half the items 
. P ^ 5 p \ 

plus^at least another k items are answered cortectly, and thus both 

X, ,X« 2: k no matter how the test is split, Hfence all v splits 
Ip 2p 

contribute to A. » 
The coefficient 

The above analysis yields an equation for 0, the mean split-half 
coefficient of agreement. For X^ in each of the five subsets, de- 
fine the following functions 4>^(X), i«l,.,.,5 J 
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1. for 0 =^ X =£r k-1 *i (X) - 1 

J 

2. ' -.k ^. X 2k-2 



j«»X-(k-l) / 



3. ' X - 2k-l ())3(X) = 0 



4. 



5. :£ X =t n (frsXX) o 1 

Here, 4>^(X) is the proportion of splits which contribute to A or D f 
a^lven score X, 

Then Equation A.l can be rewritten 



(A. 2) 3 "I ^*i<^^ • 



where the index i depends on the value of X . Hence 3 has range 

P . • 

[0,1]; it is 0 when all X^ e S^; 1 when all X^ e U S^. 

Although Equation A. 2 sums up the analysis rather simply, it is 

"^inefficient fdr computing purpose!. A more^ficient method is to 

generate a frequency distribution of total scores, and compute <()^(X) 

only once for each possible value. In general, let f^ be f^e- 

n - ,j 

quency of score X, X « 0,...,n , 5^ f « N. Then 

^X«0 

^ x«o ^ 

where again the index i depends on the value of X. 
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More explicitly, since for sonie. values of X, <|>^(X) » 0 or ,1, 



I f^ + 2 f • V(x-[k-i], k-i) + Z f ^;-*Hk.x-k) • + If 



where 

.(a,b) - 2^ 



Adjustment for odd n » 

For an odd number of items, a test split is defined as resulting 

when one item is deleted and, the remaining items are divided Into two 

n-1 

sets, each containing items. In ^this case, k is the smallest 

integer >^ . The item deleted may be chosen in n ways, each 

yielding a distinct .set of n-1 items to be split • Hence there are 



" ((n-1)/,) 



1)/ J P°^^^^-^® split halves^ if one again considers each half to 
be labeled. 

For person p, with total score X , the response vector X has 

P P . 

X I's and n-X o's. Thus, fpr person p, X of.^the n possible 
P ^ P ' ^'^ p 

choices of the deleted item will t^e^^ult^in a set of n-1 items con- 
taining X^"*! I's, and n-X^ choices will result in a set containing 



A-7 



I's, Thus the contribution to^ coefficient p for person p, rather 

than 9i.(X ), will be j£ ?J . (X -1) + "^^p (X ) and hence, taking 
^ 1 p ^ip- ^ip 

1 ^ r 7 

the pean over persons, 6 « 5! X • <j>..(X ^) + (n-X )*<j>^(X )| . 

' ' ^ p«lL pip. P i P J ^ ' 

As before. It Is necessary to compute <j>^(X) only once for each possible 
value' of X, 

p- 

But also as before, the computation Is more efficient If we utilize 

the frequency distribution of "total scores. Recall that for a score 

of X on n (odd) items, for n-X choices of the item delated the 
P P . . 

total score on n-l items will r^mallS at X , and for X * choices the 

- p p 

total score on n-1 items will be reduced to X -1, The. effect is 

that of a .transformation, ► , on the set of total scores. In symbols, 

X ^ > X in -^^-^ of t!ie cases: 
n 

* 

X ^ - *> X-1 in — ' of the cases, and hence ' 
n 

« 

t X+1 

X+1 > X in of the cases. 

n 

Hence, a total score of X is arrived at with frequency 

g(X) - — f r^- — f_^- ! -(Note that, since ^f_^, » 0, 
°^ n X n x+1 ' ^ n+1 

g(n) - ^ f + ^ f - 0, and therefore :i:g(X) - ^ g(X). 

^ n-1 n 

Furthermore, it is easily shown that g(X) " •) Thus, 

X-0 X-0 ^ 

taking the mean over the transformed frequency distribution of total 
scores, coefficient beta is ' 
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• n-1 
/5 - ^ 2: g(X)-*.(X) 



where once again the index i depends on the value of X. Thus, in 
jractice, the computation of 6 is identical for the cases of even* and 
odd n, except that in the latter case one first performs an additional 

step, replacing * f by (""X)f^ + (X+l)f^^ X = 0,1, . . . ,n-l ^ 

n 

and then using n-1 in pla^e of n in the computations of k and <()^(X) 

Coefficient beta and trichotomous data 

The authors of , some commercial instructional programs, such as 
Developing Mathematical Processes (DMP Staff, 1974), contend that 
mastery/nonmastery ftlone is not a sufficient categorization of test 
results, and that more valuable information and more appropriate teacher 
options become available if the test result data are trichotomized. 
Coefficient beta, as outlined above, is clearly not sensitive t^such 
a trichotomization sceme. 

The trichotomous coefficient of agreement in such a situation would 

'be equal to ^ 
„ A+E+I 



based orf the table 





+• 


* 






it 


+ 


A 


B 


C 








D 


E 


F 


.7 






G 


.H 


I 












N 
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inhere the symbols +, *j - stand for fhe three cat'^orization^. If 
a coefficient^ analogous to ^ were to be applicable to this sort of 

, V A + E + I ' ^ * ' 

"x. ' '1 ■ ... s s s * 
set-up, It BhouiiWDe equal to ^ ^ ^ or the mean spllt- 

. ~v ^"-^ 

halff trtchotomous coefficient of agreement. 

*^ * A3 lt> turns out, such a coefficient Qan< be fferlved, although the 
derivation is not presented here. The aifalysis of this coefficient, 
ai^thpfif^ more complex In places than that foi^3. Is essentially parallel 

to that iJresented earlier. Inst edd of partitioning the set {0,...,n} 

,/,." 

lnto;flva subsets, one partltl<i^6 It Into seven. Recall that for ^ 

! ' ■ .4^ 

coeff|.cl€int 3, k Is the ro^-ttlmum number of Items on a half-test that ^ 

musf be answered correetly^ln order to receive a mastery classification. 
If, for trlchotomlzed 4ita, one In addition lets £ be the minimum num- 
ber of? Items kOi!i the Willfrtest that must be answered correctly In order 
to Vecelve the mldf^le classification, then th6 seven subsets of {0,...,n}, 
togetlier with t^fil^lr corresponding values of ((> . (X) , 1 ■ 1,...,7 , are 

s. - (e^.t-.i^i) *i(X) - 1 . 



S,,^t£.....2£-2} <|,2(X) ^ (j)(n/2-j)/ (n/2) ' 



{24-1} *3(X) = 0 



tl - ..{2JI, . ,2k-2l H(X) " ^ ] (n/ -j) ./ [n/2) 



- .C|»l- k,...,n} *7(X) 
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n 



where^ . 0 < £ < k <_ -r- , 



and 



u, = max(Jl, X-[k-l3) , 
U2 '= niin(k-l, X-Jl) . 
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Note that = 1 implies = { and ^ implies = { }. 

^ As before, the computation is made mope efficient by utilizing the 
fre^iiency distribution of total scores, and hence a formula for-^P^^^ 
mean' split-half tricjaotomous coefficient of agreement, is 



\ 



Since (J).(X). is 0 or 1 in four of the seven cases, this can be more 



explicitly Rewritten 
f 

1 



^3 " N 



Z-1 21-2 2k-2 

^ f +' ^ f.<^JX-[Z-Th + ^ f. 6 (u ,,u) 

X=0 ^ X=Z ^ f X=2Z X X ± ^ 



2" V '^x^'"'-^"^^ ■*" 
X=2k . 



where 



FRir 



and and u^,. are as above, ^ 

^ The trichot^mous coefficient IncorpcJrates the same adjustments 
for an odd number of items ^s_^oes the dichotomous coefficient, ex- 
cept that h-»l is used in calculating Z as well as k ^nd (J) (X) . 

.. ■ ^ * 1 
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Note that if the test Is multiple choice, the lower of the two 
critei^ion levels should nor be set near the percent of items, which 
should be answered correctly due to chance, as thisr would result in 
unreliable^ classification decisions between the lower two categories. 
In this cas6, if there are a significant number of nonmasters. in the 
population\the value of 6^ would tend to be rather low, as would be 
expected. ^ /* ^ 
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